feat(query): `RowBinaryWithNamesAndTypes` for enchanced type safety #221

slvrtrn · 2025-05-20T10:51:16Z

Summary

Warning

This is a work in progress implementation and may change significantly.
It implements RBWNAT for Query only; Insert should be a new PR.

First of all, let's abbreviate RowBinaryWithNamesAndTypes format as RBWNAT, and the regular RowBinary as just RB for simplicity.

There is a significant amount of issues created in the repo regarding schema incompatibility or obscure error messages in the repository (see the full list below). The reason is that the deserialization is effectively implemented in a "data-driven" way, where the user structures dictate the way the stream in RB should be (de)serialized, so it is possible to have a hiccup where two UInt32 may be deserialized as a single UInt64, which in worst case scenario may lead to corrupted data. For example:

This test will deserialize a wrong value on the main branch, cause DateTime64 is streamed as 8 bytes (Int64), and 2x(U)Int32 are also streamed as 8 bytes in total. It correctly throws an error on this branch now with enabled validation mode.

#[tokio::test]
#[cfg(feature = "time")]
async fn test_serde_with() {
    #[derive(Debug, Row, Serialize, Deserialize, PartialEq)]
    struct Data {
        #[serde(with = "clickhouse::serde::time::datetime64::millis")]
        n1: OffsetDateTime, // underlying is still Int64; should not compose it from two (U)Int32
    }

    let client = prepare_database!().with_struct_validation_mode(StructValidationMode::EachRow);
    let result = client
        .query("SELECT 42 :: UInt32 AS n1, 144 :: Int32 AS n2")
        .fetch_one::<Data>()
        .await;

    assert!(result.is_err());
    assert!(matches!(
        result.unwrap_err(),
        Error::InvalidColumnDataType { .. }
    ));
}

This PR introduces:

RBWNAT format usage instead of RB, which allows for stronger type safety guarantess. This is regulated by the StructValidationMode client option, which has two possible modes:
- First(1) (default) - uses RBWNAT and checks the types for the first row only, so it retains most of the performance compared to the Disabled mode, while still providing significantly stronger guarantees.
- EachRow - uses RBWNAT and every single row is validated. It is expected to be significantly slower than the default mode.
New types internal crate that contains utils to deal with RBWNAT and Native data types strings parsing into a proper AST. Rustified from https://github.com/ClickHouse/clickhouse-js/blob/main/packages/client-common/src/parse/column_types.ts, but not entirely. The most important part is the correctness and the tests, the actual implementation detail can be adjusted in the follow-up.
An ability to conveniently deserialize map as a HashMap<K, V>, and not only as Vec<(K, V)>, which was confusing.
Clearer error messages for schema mismatch.
A lot of tests and more to come, especially with difficult corner cases (nested nullable, multi-dimensional mixed arrays/maps/tuples/enum, etc).

Likely possible to implement:

Support for "shuffled" structure definition, where the order of the fields does not match the DB, but the names and types are correct; it should be possible by leveraging (perhaps optionally) visit_map API allowed for deserialize_struct instead of current visit_seq, which processes a struct as a tuple.

Source files to look at:

data_types.rs - RBWNAT/Native data types strings parsed as proper AST. See the tests for more output examples, like Variant data type parsing.
rbwnat.rs - main integration tests WIP.
validation.rs - validating Serde calls against the data types provided in the RBWNAT format header.

Current benchmarks results

Select numbers

This branch:

compress  validation  elapsed  throughput  received
    none   FirstN(1)   1.474s  2587 MiB/s  3815 MiB
    none        Each   3.053s  1250 MiB/s  3815 MiB

Main branch:

compress  elapsed  throughput  received
    none   1.296s  2943 MiB/s  3815 MiB

Still losing a bit when validating only the first record. Each row validation mode, as expected, is significantly slower. But NYC taxi data (a more real world scenario cause no one streams system.numbers I guess..) shows totally different and very promising results.

NYC taxi data

This branch:

compress  validation  elapsed  throughput  received
    none   FirstN(1)  939.352ms   361 MiB/s   339 MiB
     lz4   FirstN(1)  950.834ms   357 MiB/s   151 MiB
    none        Each  988.465ms   343 MiB/s   339 MiB
     lz4        Each   1.186s   286 MiB/s   151 MiB

Main branch:

compress  elapsed  throughput  received
    none  939.392ms   361 MiB/s   339 MiB
     lz4  943.551ms   360 MiB/s   151 MiB

The difference is, in fact, not that great, especially considering the benefits Each validation mode provides. Perhaps it is not a bad idea use Each as the default mode instead of First(1)?

Issues overview

Note

If an issue is checked in the list, that means there is also a test that demonstrates proper error messages in case of schema mismatch.

Resolved issues

Closes Support the FixedString type #49
Closes Save vector of custom type error into String type in clickhouse #72
Closes bug: failing to deserialize f64 correctly #113
Closes Query::fetch_one() and Query::fetch_optional() can still finish successfully even in the case of schema mismatch #187
Closes Field ordering on structs makes fetching fail unexpectedly. #211 - at least now there is a clear error message
Closes Custom("premature end of input") on max(DateTime64) query on empty table #218 - clearer error messages

Related issues

Previously closed issues with unclear error messages

CANNOT_READ_ALL_DATA Error when serializing Nested types with Map fields #214
throw an error instead of returning empty if client fails to deserialize query results #185
selects result is always an InvalidUtf8Encoding(Utf8Error { valid_up_to: 84, error_len: Some(1) }) error #173
Deserialize Array(Map(String String)) #114 - HashMap can now be easily deserialized, since we know the exact schema
Erroneous behaviour when selecting multiple booleans #112 - Nullable vs Non-Nullable booleans
NULL successfully deserializes into u8, i8, and bool #100 - a similar issue to 2xUInt32 decoded into 1xInt64

Follow-up issues

CANNOT_READ_ALL_DATA when calling end() #109 - we need to implement insert with RBWNAT, so that we can add proper tests reproducing the reported scenarios, and then we can close it.
Use RowBinaryWithNamesAndTypes #10 and Consideration of Type Safety #199 - should be closed after RBWNAT insert implementation
Support for serializing maps #99 and how to support map? #101 - HashMaps are supported for selects, will be supported in inserts with RBWNAT as well
Other issues with insert and unclear error messages that are already closed, but require a additional verification, testing error messages as RBWNAT will allow that:

slvrtrn

Added a few comments regarding the intermediate implementation.

Cargo.toml

slvrtrn · 2025-05-21T21:35:39Z

rowbinary/src/decoders.rs

+    }
+    let result = String::from_utf8_lossy(&buffer.copy_to_bytes(length)).to_string();
+    Ok(result)
+}


more or less the same as the implementation in the deserializer. Perhaps as a follow-up, all the reader logic can be extracted in similar functions with #[inline(always)]?

slvrtrn · 2025-05-21T21:35:54Z

rowbinary/src/error.rs

+
+    #[error("Type parsing error: {0}")]
+    TypeParsingError(String),
+}


Needs revising.

rowbinary/src/leb128.rs

slvrtrn · 2025-05-21T21:49:57Z

src/rowbinary/de.rs

+            0 => visitor.visit_some(&mut RowBinaryDeserializer {
+                input: self.input,
+                validator: inner_data_type_validator,
+            }),
            1 => visitor.visit_none(),


This is currently the main drawback of the validation implementation if we want to disable it after the first N rows for better performance. If these first rows are all NULLs, then we do not properly validate the inner type.

src/rowbinary/de.rs

serprex · 2025-05-29T21:29:38Z

src/cursors/row.rs

+                        return Ok(());
+                    }
+                    Ok(_) => {
+                        // TODO: or panic instead?


Returning error when we're already returning Result seems correct

It does not make sense to handle this error and continue if we cannot even parse the column header with names and types. That means everything goes entirely wrong, and should be unreachable... unless there are some odd network/LB issues?

still seems wrong to panic over bytes controlled by another actor

serprex · 2025-05-29T21:30:31Z

src/rowbinary/tests.rs

-        assert_eq!(actual, sample());
-    }
-}
+// #[test]


intend to restore this?

Will restore, but not sure if we need this test with so many integration tests that do essentially the same.

serprex · 2025-05-29T21:31:16Z

src/rowbinary/utils.rs

+
+        shift += 7;
+        if shift > 57 {
+            // TODO: what about another error?


what's the rationale behind 57?

That was taken from https://github.com/ClickHouse/clickhouse-rs/blob/main/src/rowbinary/de.rs#L322-L344

serprex · 2025-05-29T21:32:29Z

tests/it/query.rs

-        .expect("failed to fetch string");
-    assert_eq!(result, "\x01\x02\x03\\ \"\'");
+    //
+    // let result = client


intended to restore?

Thanks for noticing. Don't know why it was commented out.

Copilot

Pull Request Overview

This PR introduces support for the RowBinaryWithNamesAndTypes (RBWNAT) format for enhanced type safety in query deserialization, along with new validation modes and improvements in error messages and benchmarks. Key changes include:

Adding new macros and tests to assert panic conditions on schema mismatches.
Refactoring query execution to use RBWNAT and propagating a client-wide validation mode.
Enhancements to serialization/deserialization, including a new utility for LEB128 encoding and improved columns header parsing.

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/it/main.rs	Added macros to assert panics during fetch operations.
tests/it/insert.rs	Updated table creation and row type in the rename insert test.
tests/it/cursor_stats.rs	Adjusted expected decoded byte count to account for the RBWNAT header.
tests/it/cursor_error.rs	Revised error handling and test scenarios for query timeouts.
src/validation_mode.rs	Introduced new validation mode enum with documentation.
src/rowbinary/ser.rs	Switched to using put_leb128 and replaced an error return with a panic.
src/cursors/row.rs	Implemented async header reading and conditionally validated rows.
examples/mock.rs	Updated the mock provide handler to include schema information.
benches/select_*.rs	Updated benchmarks to pass the validation mode to the client.
Cargo.toml	Updated bench configuration and dependency versions.

Comments suppressed due to low confidence (1)

src/rowbinary/tests.rs:117

The test 'it_deserializes' is commented out, which may reduce test coverage for deserialization; please re-enable or provide context for the deactivation.

// #[test]
// fn it_deserializes() { ... }

Copilot · 2025-05-30T10:02:33Z

src/rowbinary/ser.rs

-            return Err(Error::VariantDiscriminatorIsOutOfBound(
-                variant_index as usize,
-            ));
+            panic!("max number of types in the Variant data type is 255, got {variant_index}")


Instead of panicking when the variant index exceeds 255, consider returning a proper error to allow for graceful error handling.

It does not make sense to handle this error; this means the entire deserialization went wrong, cause the server ensures max 255 variants per type.

ilidemi · 2025-05-30T19:49:46Z

Some comments from Claude Code (could be complete slop, I know little about Rust):

General

1. Panic-Driven Error Handling is Unacceptable
The validation system at src/rowbinary/validation.rs:68-82 uses panic! for schema mismatches. This is fundamentally wrong in a library:

Library rule # 1: Never panic on user input
Creates unrecoverable failures for recoverable errors
Violates Rust's error handling principles
Makes debugging extremely difficult in production

Recommendation: Convert all panic! calls to proper Result<T, ValidationError> returns.

2. API Breaking Changes Without Semver

Query::fetch() now uses RowBinaryWithNamesAndTypes instead of RowBinary (line 90 in query.rs)
RowCursor::new() signature changed to require ValidationMode
This changes wire protocol - breaking change disguised as feature addition

Performance optimizations

🔥 High-Impact Optimizations

1. Eliminate Allocations in Error Paths

Location: src/rowbinary/validation.rs:46-51

// Current: Allocates strings in panic paths
format!("{}.{}", self.get_struct_name(), c.name)
"Struct".to_string()

// Better: Use Cow<str> for zero-allocation error messages
use std::borrow::Cow;
fn get_current_column_name(&self) -> Cow<'static, str> {
    // avoid format! allocation
}

2. Optimize String Deserialization

Location: src/rowbinary/de.rs:67, 184

// Current: Always allocates Vec for String
fn read_vec(&mut self, size: usize) -> Result<Vec<u8>> {
    Ok(self.read_slice(size)?.to_vec())  // ❌ Always allocates
}

// Better: Only allocate when necessary
fn deserialize_string<V: Visitor<'data>>(self, visitor: V) -> Result<V::Value> {
    let slice = self.read_slice(size)?;
    match str::from_utf8(slice) {
        Ok(s) => visitor.visit_borrowed_str(s),  // Zero-copy!
        Err(_) => {
            let string = String::from_utf8_lossy(slice).into_owned();
            visitor.visit_string(string)  // Only allocate for invalid UTF-8
        }
    }
}

3. Branch Prediction Optimization in Validation

Location: src/cursors/row.rs:96-104

// Current: Pattern matching on validation count
let (result, not_enough_data) = match self.rows_to_validate {
    0 => rowbinary::deserialize_from::<T>(&mut slice, &[]),
    u64::MAX => rowbinary::deserialize_from::<T>(&mut slice, &self.columns),
    _ => { /* ... */ }
};

// Better: Likely/unlikely hints for branch predictor
let (result, not_enough_data) = if likely(self.rows_to_validate > 0) {
    if self.rows_to_validate == u64::MAX {
        rowbinary::deserialize_from::<T>(&mut slice, &self.columns)
    } else {
        self.rows_to_validate -= 1;
        rowbinary::deserialize_from::<T>(&mut slice, &self.columns)
    }
} else {
    rowbinary::deserialize_from::<T>(&mut slice, &[])
};

🎯 Medium-Impact Optimizations

4. Validation State Caching

Location: src/rowbinary/validation.rs:87-90

// Current: Validates every field access
self.validator.validate(serde_type)?;

// Better: Cache validation results for repeated patterns
struct CachedValidator {
    last_column_idx: usize,
    last_validation: Option<ValidatedState>,
}

5. SIMD-Optimized Size Checks

Location: src/rowbinary/de.rs:89

// Current: Individual size checks
ensure_size(&mut self.input, core::mem::size_of::<$ty>())?;

// Better: Batch size checks for multiple fields
fn ensure_sizes_batch(input: &[u8], sizes: &[usize]) -> Result<()> {
    // SIMD-optimized batch boundary checking
}

6. Avoid Repeated Column Lookups

Location: src/rowbinary/validation.rs:42-52

// Current: String formatting on every error
format!("{}.{}", self.get_struct_name(), c.name)

// Better: Pre-format common error prefixes
struct ErrorContext {
    column_prefix: String,  // Pre-computed once per struct
}

slvrtrn · 2025-05-30T21:34:00Z

@ilidemi, thanks. Here are some comments on that:

Panic-Driven Error Handling is Unacceptable

It is more than acceptable here. It panics in case of an invalid struct definition in the code, there is no reason to continue (de)serializing junk. It is unsafe, and Rust guidelines explicitly say the following about trying to continue with incorrect values:

The panic! macro signals that your program is in a state it can’t handle and lets you tell the process to stop instead of trying to proceed with invalid or incorrect values.

I get that we have exactly this situation. See the explanation about the Result:

The Result enum uses Rust’s type system to indicate that operations might fail in a way that your code could recover from.

It is not possible to recover a program that uses the crate from an incorrect struct definition. It must be fixed by the user.

API Breaking Changes Without Semver
Query::fetch() now uses RowBinaryWithNamesAndTypes instead of RowBinary (line 90 in query.rs)
RowCursor::new() signature changed to require ValidationMode
This changes wire protocol - breaking change disguised as feature addition

Well, RBWNAT instead of RB is the intention. It will also go as 0.14.0, where breaking changes are actually allowed. RowCursor is private to the end user, it's visibility is pub(crate), so it does not matter.

Optimize String Deserialization

Worth looking into.

Branch Prediction Optimization in Validation

Hints are unstable, and we cannot use them. But RowCursor became ~20% slower, that is a fact, and ideally we need to find a way to reduce the overhead, but I couldn't yet.

Validation State Caching

It already validates only one array value and one key-value pair of a Map.

SIMD-Optimized Size Checks

Worth checking indeed.

Eliminate Allocations in Error Paths

Avoid Repeated Column Lookups

There are no errors, only panics, so does not really matter IMO.

ilidemi · 2025-05-31T00:43:07Z

25% hit rate, definitely space for improvement 😊

Got a few more from Opus 4:

On panic vs Result - database could be migrated from underneath the app (or all apps updated but one). The app wouldn't be able to access the data anyway, but there's a difference between failing one path and crashing the process.
ensure_size could be #[inline(always)], although the compiler would likely figure it out
Save on runtime branches for validation on every row by using static generics to distinguish fast path from slow path. Although it agreed that the branch predictor would likely catch on when they stop being taken many times in a row.

slvrtrn and others added 7 commits May 7, 2025 21:00

Draft RowBinaryWNAT/Native header parser

31d109a

Add RBWNAT header parser

3a66d7a

RBWNAT deserializer WIP

cf72759

RBWNAT deserializer - more types WIP

5a60295

RBWNAT deserializer - validation WIP

b338d88

RBWNAT deserializer - validation WIP

8ae3629

Merge branch 'main' into row-binary-header-check

acced9e

slvrtrn mentioned this pull request May 20, 2025

Consideration of Type Safety #199

Open

mshustov requested review from Copilot and loyd May 20, 2025 12:36

This comment was marked as resolved.

Sign in to view

RBWNAT deserializer - validation, benches WIP

c20af77

slvrtrn commented May 21, 2025

View reviewed changes

slvrtrn added 10 commits May 22, 2025 17:58

RBWNAT deserializer - improve performance

c4a608e

RBWNAT deserializer - clearer error messages on panics

0d416cf

Fix clippy and build

65cb92f

Fix core::mem::size_of import

fbfbd99

Slightly faster implementation

1d5c01a

Add Geo types, more tests

227617e

Support root level tuples for fetch

986643f

Add Variant support, improve validation, tests

b26006e

Fix compile issues, clippy, etc

8567200

Fix older Rust versions compile issues, docs

a1181a0

slvrtrn mentioned this pull request May 29, 2025

Add methods that allow to insert/fetch a stream of bytes in arbitrary format #174

Open

2 tasks

slvrtrn added 3 commits May 29, 2025 19:20

Merge remote-tracking branch 'origin' into row-binary-header-check

b77f45d

Add NYC benchmark

04c7a20

Add compression to the NYC benchmark

1f6c9e6

slvrtrn changed the title ~~PoC (Query): RowBinaryWithNamesAndTypes for enchanced type safety~~ feat (query): RowBinaryWithNamesAndTypes for enchanced type safety May 29, 2025

slvrtrn changed the title ~~feat (query): RowBinaryWithNamesAndTypes for enchanced type safety~~ feat(query): RowBinaryWithNamesAndTypes for enchanced type safety May 29, 2025

slvrtrn mentioned this pull request May 29, 2025

benchmarks: select NYC taxi data #227

Open

slvrtrn mentioned this pull request May 29, 2025

Support for native TCP protocol #226

Open

serprex reviewed May 29, 2025

View reviewed changes

slvrtrn requested a review from Copilot May 30, 2025 10:02

slvrtrn marked this pull request as ready for review May 30, 2025 10:02

Copilot AI reviewed May 30, 2025

View reviewed changes

feat(query): RowBinaryWithNamesAndTypes for enchanced type safety #221

Are you sure you want to change the base?

feat(query): RowBinaryWithNamesAndTypes for enchanced type safety #221

Conversation

slvrtrn commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Current benchmarks results

Select numbers

NYC taxi data

Issues overview

Resolved issues

Related issues

Previously closed issues with unclear error messages

Follow-up issues

Uh oh!

This comment was marked as resolved.

Uh oh!

slvrtrn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slvrtrn May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilidemi commented May 30, 2025

General

Performance optimizations

🔥 High-Impact Optimizations

1. Eliminate Allocations in Error Paths

2. Optimize String Deserialization

3. Branch Prediction Optimization in Validation

🎯 Medium-Impact Optimizations

4. Validation State Caching

5. SIMD-Optimized Size Checks

6. Avoid Repeated Column Lookups

Uh oh!

slvrtrn commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilidemi commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

feat(query): `RowBinaryWithNamesAndTypes` for enchanced type safety #221

feat(query): `RowBinaryWithNamesAndTypes` for enchanced type safety #221

slvrtrn commented May 20, 2025 •

edited

Loading

slvrtrn May 30, 2025 •

edited

Loading

slvrtrn commented May 30, 2025 •

edited

Loading

ilidemi commented May 31, 2025 •

edited

Loading